Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms.
translated by 谷歌翻译
Several self-supervised representation learning methods have been proposed for reinforcement learning (RL) with rich observations. For real-world applications of RL, recovering underlying latent states is crucial, particularly when sensory inputs contain irrelevant and exogenous information. In this work, we study how information bottlenecks can be used to construct latent states efficiently in the presence of task-irrelevant information. We propose architectures that utilize variational and discrete information bottlenecks, coined as RepDIB, to learn structured factorized representations. Exploiting the expressiveness bought by factorized representations, we introduce a simple, yet effective, bottleneck that can be integrated with any existing self-supervised objective for RL. We demonstrate this across several online and offline RL benchmarks, along with a real robot arm task, where we find that compressed representations with RepDIB can lead to strong performance improvements, as the learned bottlenecks help predict only the relevant state while ignoring irrelevant information.
translated by 谷歌翻译
深度加强学习(RL)是解决复杂的现实问题的强大框架。在框架中使用的大型神经网络传统上与更好的泛化能力相关,但它们的增加的大小需要广泛的培训持续时间,大量硬件资源和较长推理时间的缺点。解决这个问题的一种方法是修剪神经网络,只留下必要的参数。用于在固定数据分布的应用中施加稀疏性的最先进的并发修剪技术。但是,他们尚未在RL的背景下大大探索。我们缩小了RL和单次修剪技术之间的差距,并将一般修剪方法呈现给离线RL。在RL培训开始之前,我们利用固定数据集进行修剪神经网络。然后,我们运行不同网络稀疏度水平的实验,并评估连续控制任务中的初始化技术修剪的有效性。我们的结果表明,随着95%的网络权重修剪,离线-RL算法仍然可以在我们的大部分实验中保持性能。据我们所知,没有先前的工作,利用在这种高水平的稀疏性的RL保留的性能中进行修剪。此外,在未改变学习目标的情况下,可以在任何现有的离线-RL算法中容易地集成到任何现有的离线RL算法中。
translated by 谷歌翻译
我们假设经验研究离线增强学习(RL)的样本复杂性对RL在现实世界中的实际应用至关重要。最近的几项作品表明了直接从离线数据学习策略的能力。在这项工作中,我们询问了从离线数据学习的样本数量的依赖性问题。我们的目标是强调,研究离线R1的样本复杂性很重要,是现有离线算法有用的指标。我们提出了一种用于离线RL的样本复杂性分析的评估方法。
translated by 谷歌翻译
Accurate recognition of food items along with quality assessment is of paramount importance in the agricultural industry. Such automated systems can speed up the wheel of the food processing sector and save tons of manual labor. In this connection, the recent advancement of Deep learning-based architectures has introduced a wide variety of solutions offering remarkable performance in several classification tasks. In this work, we have exploited the concept of Densely Connected Convolutional Neural Networks (DenseNets) for fruit quality assessment. The feature propagation towards the deeper layers has enabled the network to tackle the vanishing gradient problems and ensured the reuse of features to learn meaningful insights. Evaluating on a dataset of 19,526 images containing six fruits having three quality grades for each, the proposed pipeline achieved a remarkable accuracy of 99.67%. The robustness of the model was further tested for fruit classification and quality assessment tasks where the model produced a similar performance, which makes it suitable for real-life applications.
translated by 谷歌翻译
We consider a multi-agent episodic MDP setup where an agent (leader) takes action at each step of the episode followed by another agent (follower). The state evolution and rewards depend on the joint action pair of the leader and the follower. Such type of interactions can find applications in many domains such as smart grids, mechanism design, security, and policymaking. We are interested in how to learn policies for both the players with provable performance guarantee under a bandit feedback setting. We focus on a setup where both the leader and followers are {\em non-myopic}, i.e., they both seek to maximize their rewards over the entire episode and consider a linear MDP which can model continuous state-space which is very common in many RL applications. We propose a {\em model-free} RL algorithm and show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret bounds can be achieved for both the leader and the follower, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps under the bandit feedback information setup. Thus, our result holds even when the number of states becomes infinite. The algorithm relies on {\em novel} adaptation of the LSVI-UCB algorithm. Specifically, we replace the standard greedy policy (as the best response) with the soft-max policy for both the leader and the follower. This turns out to be key in establishing uniform concentration bound for the value functions. To the best of our knowledge, this is the first sub-linear regret bound guarantee for the Markov games with non-myopic followers with function approximation.
translated by 谷歌翻译
Handling and digesting a huge amount of information in an efficient manner has been a long-term demand in modern society. Some solutions to map key points (short textual summaries capturing essential information and filtering redundancies) to a large number of arguments/opinions have been provided recently (Bar-Haim et al., 2020). To complement the full picture of the argument-to-keypoint mapping task, we mainly propose two approaches in this paper. The first approach is to incorporate prompt engineering for fine-tuning the pre-trained language models (PLMs). The second approach utilizes prompt-based learning in PLMs to generate intermediary texts, which are then combined with the original argument-keypoint pairs and fed as inputs to a classifier, thereby mapping them. Furthermore, we extend the experiments to cross/in-domain to conduct an in-depth analysis. In our evaluation, we find that i) using prompt engineering in a more direct way (Approach 1) can yield promising results and improve the performance; ii) Approach 2 performs considerably worse than Approach 1 due to the negation issue of the PLM.
translated by 谷歌翻译
The extent to which men and women use language differently has been questioned previously. Finding clear and consistent gender differences in language is not conclusive in general, and the research is heavily influenced by the context and method employed to identify the difference. In addition, the majority of the research was conducted in written form, and the sample was collected in writing. Therefore, we compared the word choices of male and female presenters in public addresses such as TED lectures. The frequency of numerous types of words, such as parts of speech (POS), linguistic, psychological, and cognitive terms were analyzed statistically to determine how male and female speakers use words differently. Based on our data, we determined that male speakers use specific types of linguistic, psychological, cognitive, and social words in considerably greater frequency than female speakers.
translated by 谷歌翻译
社区检测是网络科学中的经典问题,在各个领域都有广泛的应用。最常用的方法是设计算法,旨在最大程度地跨越网络分配到社区中的不同方式,以最大化效用函数,模块化。尽管它们的名称和设计理念,但当前的模块化最大化算法通常无法最大化模块化或保证与最佳解决方案的任何接近。我们提出了Bayan算法,该算法与现有方法不同,该算法返回网络分区,以确保最佳或靠近最佳解决方案。 Bayan算法的核心是一种分支和切割方案,该方案解决了模块化最大化问题的稀疏整数编程公式,以最佳或在一个因素内近似它。我们使用合成和真实网络分析了Bayan对22种现有算法的性能。通过广泛的实验,我们不仅在最大化模块化方面展示了Bayan的独特能力,而且更重要的是在准确检索地面真实群落方面。 Bayan的比较性能水平在数据(图)生成过程中噪声量的变化上保持稳定。 Bayan作为确切的模块化最大化算法的性能也揭示了在社区准确检索中最大模块化分区的理论能力限制。总体而言,我们的分析指出,通过精确(近似)最大化的网络中的模块化(近似$ \ sim10^3 $边缘(和较大的网络)),BAYAN是对社区进行方法基础检测的合适选择。图形优化和整数编程的前瞻性进步可以进一步推动这些限制。
translated by 谷歌翻译
本文介绍了素描的现实,这种方法结合了AR素描和驱动的有形用户界面(TUI),用于双向素描交互。双向草图使虚拟草图和物理对象通过物理驱动和数字计算相互影响。在现有的AR素描中,虚拟世界和物理世界之间的关系只是一个方向 - 虽然物理互动会影响虚拟草图,但虚拟草图对物理对象或环境没有返回效果。相反,双向素描相互作用允许草图和驱动的tuis之间的无缝耦合。在本文中,我们采用桌面大小的小型机器人(Sony Toio)和基于iPad的AR素描工具来演示该概念。在我们的系统中,在iPad上绘制和模拟的虚拟草图(例如,线,墙壁,摆和弹簧)可以移动,动画,碰撞和约束物理Toio机器人,就像虚拟草图和物理对象存在于同一空间中一样通过AR和机器人运动之间的无缝耦合。本文贡献了一组新型的互动和双向AR素描的设计空间。我们展示了一系列潜在的应用,例如有形的物理教育,可探索的机制,儿童有形游戏以及通过素描的原位机器人编程。
translated by 谷歌翻译